Data Containers

A place for everything

Rodney J. Dyer

Vectors

A Vector

A vector is a storage container for data of a uniform data class type.

x <- c( 1, 2, 3, 4 )

All programmers are lazy and know that the fewer key presses you have to make, the less likely an error will be introduced. So the function named combine is represented as c().

x
[1] 1 2 3 4

Its job is to smush a bunch of data into a simple container.

Accessing Elements

A vector contains similar data types and each element can be accessed using numerical indices nested with square brackets [ & ].1

x[2]
[1] 2

Introspection

Because x is a vector AND it contains numeric data, the introspection operators for both vector and numeric will return TRUE.

 

is.vector( x )
[1] TRUE
is.numeric( x )
[1] TRUE

 

The data in x ARE both vectors and numeric types.

Other Data Types

As long as the base data type is the exact same, vectors will always work properly.

c("A", "B", "C", "The Cat jumped over the moon")
[1] "A"                            "B"                           
[3] "C"                            "The Cat jumped over the moon"
c( TRUE, FALSE, FALSE, TRUE)
[1]  TRUE FALSE FALSE  TRUE

No Mixing Allowed

You CANNOT mix data types in a single vector and keep the same kinds of data. R will coerce to a least common data type so that they are all of the same type.

c( 1, TRUE, FALSE, 23)
[1]  1  1  0 23

 

c( 1, TRUE, "FALSE", 23)
[1] "1"     "TRUE"  "FALSE" "23"   

Sequences

Sometimes it is helpful to make a a sequence of values in a vector. R has some built-in functionality here for that.

 

Sequence Operator

w <- 1:6
w
[1] 1 2 3 4 5 6

The seq() function

x <- seq(10,30, by=3)
x
[1] 10 13 16 19 22 25 28

 

LETTERS

y <- LETTERS[1:5]
y
[1] "A" "B" "C" "D" "E"

The seq() function (again)

z <- seq(10,30, length.out = 6)
z
[1] 10 14 18 22 26 30

Vector Operators

Data within vectors can be subjected to unary opertors.

 

-z
[1] -10 -14 -18 -22 -26 -30

 

!z
[1] FALSE FALSE FALSE FALSE FALSE FALSE

Vector Operators

As well as binary operators.

 

w + z 
[1] 11 16 21 26 31 36
z^w
[1]        10       196      5832    234256  11881376 729000000

Recycling Rule

If you attempt to perform a binary operator on two vectors whose lengths are different, it will recycle the values in the shorter one.

c(1,2,3) + c(10,20,30,40,50,60)
[1] 11 22 33 41 52 63

 

But if the lengths are not clean multiples, R will give you a warning (but still give you an answer).

c(1,2,3,4) + c(10,20,30,40,50,60)
Warning in c(1, 2, 3, 4) + c(10, 20, 30, 40, 50, 60): longer object length is
not a multiple of shorter object length
[1] 11 22 33 44 51 62

Matrices

2-Dimensional Vectors

For some mathematical operations, we need to work with matrices. These are another ‘general’ container but with dimensions for rows and columns of data.

matrix( 1:9, ncol=3 )
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
matrix( LETTERS[1:9], nrow=3)
     [,1] [,2] [,3]
[1,] "A"  "D"  "G" 
[2,] "B"  "E"  "H" 
[3,] "C"  "F"  "I" 

2-Dimensional Vectors

Creating matrices are done columnwise, if you want them to be rowwise, you have to ask for it.

matrix( 1:9, ncol=3 )
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
matrix( 1:9, ncol=3, byrow = TRUE )
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

Indices

Just like vectors, the square brackets are used to access values within a matrix. However, there are now two indices, one for the row and one for the column.

 

X <- matrix( 1:9, ncol=3, byrow = TRUE )
X
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
X[1,3] <- 42
X
     [,1] [,2] [,3]
[1,]    1    2   42
[2,]    4    5    6
[3,]    7    8    9

Slicing

You can get an entire row or column using what is called a slice index.

X
     [,1] [,2] [,3]
[1,]    1    2   42
[2,]    4    5    6
[3,]    7    8    9

 

X[,2]
[1] 2 5 8
X[1,]
[1]  1  2 42

Matrix Operators

Arithamatic operators on matrices work the same way (as long as they are matrices of the proper number of rows and columns).

 

X <- matrix( 1:4, ncol=2)
X
     [,1] [,2]
[1,]    1    3
[2,]    2    4
Y <- matrix( c(3,5,7,9), ncol=2 )
Y
     [,1] [,2]
[1,]    3    7
[2,]    5    9

Binary Operators

X + Y
     [,1] [,2]
[1,]    4   10
[2,]    7   13
X * Y
     [,1] [,2]
[1,]    3   21
[2,]   10   36

This is element-wise multiplication (aka a Kronecker Product).

Matrix Multiplication

Matrix multiplication is a bit more complicated as it is a slightly more involved .

 

X %*% Y 
     [,1] [,2]
[1,]   18   34
[2,]   26   50

Lists

Lists

Lists are more versatile containers in that they allow you to store different kinds of data in them.

By default, they are numerically indexed .

lst <- list( "Bob", 32, TRUE )
lst
[[1]]
[1] "Bob"

[[2]]
[1] 32

[[3]]
[1] TRUE

Double Square Brackets

Notice that lists use two sets of square brackets instead of one—to differentiate itself from a normal vector

lst[[1]] 
[1] "Bob"
lst
[[1]]
[1] "Bob"

[[2]]
[1] 32

[[3]]
[1] TRUE

Why The Double Brackets?

This is because technically, the first element in the list is an also a list and what we are trying to get from that is the first element inside that contained list.

c( class(lst), class(lst[1]), class(lst[[1]]) )
[1] "list"      "list"      "character"

Named Lists

Lists can be made more friendly to you by using actual names for the keys associated with each value. In some languages, like python, these are referred to as dictionaries.

info <- list("Name" = "Bob", "Age" = 42)
info
$Name
[1] "Bob"

$Age
[1] 42

Notice the use of the $ in the output

Named Lists

This $ notation is used to easily grab the contents of the list at that slot.

 

info$Name <- "Robert"
info
$Name
[1] "Robert"

$Age
[1] 42

Named Lists

As well as to add new entries to the list directly.

 

info$PassedDyersClass <- TRUE
info
$Name
[1] "Robert"

$Age
[1] 42

$PassedDyersClass
[1] TRUE

Square Brackets Also Work

You can also use the double brackets AND the name of the key as a reference.

info[["Name"]]
[1] "Robert"

However this is even more work and looks a bit less elegant than the $ notation. Also, if you look at the order of operations, you’ll see that the $ notation has a higher precedence in operations than the single or double brackets (see ?Syntax).

Lists are Ubiquitous

In R, you will most likely work with list objects as analysis results rather than as a container to keep your data. Almost all analyses return their values as a list with the included components. Here is an example.

summary( iris )
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Correlation

Here is a quick correlation between the sepal and pedal lengths in some iris data set.

iris.test <- cor.test( iris$Sepal.Length, iris$Petal.Length )
iris.test

    Pearson's product-moment correlation

data:  iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8270363 0.9055080
sample estimates:
      cor 
0.8717538 

Packaging of Results

is.list( iris.test )
[1] TRUE
class( iris.test )
[1] "htest"

What is hidden inside?

                                                        Values
statistic.t                                   21.6460193457598
parameter.df                                               148
p.value                                   1.03866741944978e-47
estimate.cor                                 0.871753775886583
null.value.correlation                                       0
alternative                                          two.sided
method                    Pearson's product-moment correlation
data.name              iris$Sepal.Length and iris$Petal.Length
conf.int1                                    0.827036329664362
conf.int2                                    0.905508048821454

Custom Printing

Printing results show the components of the analysis in a way that makes sense because while it is a list

iris.test

    Pearson's product-moment correlation

data:  iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8270363 0.9055080
sample estimates:
      cor 
0.8717538 
class( iris.test )
[1] "htest"

Benefits

This is awesome because it makes it much easier to use something like the values stored in iris.test to insert the data from our analyses directly (inline) into our text.

There was a significant relationship between sepal and petal length (Pearson’s product-moment correlation, \(\rho =\) 0.872, \(t =\) 21.6, P = 1.04e-47).

Data Frames

The Lingua Franca of Data Analysis

The main container that almost all of your data will be contained in is the data.frame.

  • Similar to a spreadsheet
  • Each column as the same data type (e.g., weight, longitude, survived)
  • Each row has all the observations for a given entity.

Example

Lets consider the following data as indiviudal vectors.

names <- c("Bob","Alice","Jane")
homework.1 <- c(0.78, 0.95, 0.82)
homework.2 <- c(NA, 0.89, 0.92)

Example

These can be put into a data.frame as:

gradebook <- data.frame( names, homework.1, homework.2 )
gradebook
  names homework.1 homework.2
1   Bob       0.78         NA
2 Alice       0.95       0.89
3  Jane       0.82       0.92

Example

Each column in a data.frame is a self-contained set of data all of the same type and as such can be summarized.

summary( gradebook )
    names             homework.1      homework.2    
 Length:3           Min.   :0.780   Min.   :0.8900  
 Class :character   1st Qu.:0.800   1st Qu.:0.8975  
 Mode  :character   Median :0.820   Median :0.9050  
                    Mean   :0.850   Mean   :0.9050  
                    3rd Qu.:0.885   3rd Qu.:0.9125  
                    Max.   :0.950   Max.   :0.9200  
                                    NA's   :1       

Named Columns

Just like in a list, the columns of a data.frame are accessed by their names, and we can use the $ notation.

names(gradebook)
[1] "names"      "homework.1" "homework.2"
gradebook$homework.1
[1] 0.78 0.95 0.82
is.na( gradebook$homework.2 )
[1]  TRUE FALSE FALSE

Indexing of Elements

The easiest way to index values in a data.frame is to use the $ notation to grab the column (as a vector object) and then to use the square brackets to access a specific element.

gradebook$homework.2[1]
[1] NA
gradebook$homework.2[1] <- 0.85
gradebook
  names homework.1 homework.2
1   Bob       0.78       0.85
2 Alice       0.95       0.89
3  Jane       0.82       0.92

Indexing of Elements

You can also use the numerical indices for both row and column in the data.frame (n.b., it is row first then column).

gradebook
  names homework.1 homework.2
1   Bob       0.78       0.85
2 Alice       0.95       0.89
3  Jane       0.82       0.92

 

gradebook[2,1]
[1] "Alice"

Dimensions

The size of the elements contained in a data.frame are then relevant.

dim( gradebook )
[1] 3 3
nrow( gradebook )
[1] 3
ncol( gradebook )
[1] 3

External Data

You will almost never create data.frame objects de novo but instead load data in from some external resource. There are several functions that simplify this within tidyverse so let’s make sure we have it loaded into memory.

 

library( tidyverse )

Example Data

Here is a CSV file that is contained in this repository. Since it is a public repository, we can access it from within GitHub using a URL.

url <- "https://raw.githubusercontent.com/DyerlabTeaching/Data-Containers/main/data/arapat.csv"
beetles <- read_csv( url )
Rows: 39 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Stratum
dbl (2): Longitude, Latitude

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Example Data

Araptus attenuatus

Data

dim( beetles )
[1] 39  3
head( beetles )
# A tibble: 6 × 3
  Stratum Longitude Latitude
  <chr>       <dbl>    <dbl>
1 88          -114.     29.3
2 9           -114.     29.0
3 84          -114.     29.0
4 175         -113.     28.7
5 177         -114.     28.7
6 173         -113.     28.4

Example Data

beetles %>%
  leaflet::leaflet() %>%
  leaflet::addProviderTiles(provider = leaflet::providers$Esri.WorldTopo) %>%
  leaflet::addMarkers( ~Longitude, ~Latitude,popup = ~Stratum )

Questions

If you have any questions, please feel free to either post them as an “Issue” on your copy of this GitHub Repository, post to the Canvas discussion board for the class, or drop me an email.

Peter Sellers looking bored